20/5/1 (Item 1 from file: 8) 

DIALOG (R) File 8:Ei Compendex (R) 

(c) 2004 Elsevier Eng. Info. Inc. All rts. reserv. 

06021105 E.I. No: EIP02126889975 

Title: A technique for high bandwidth and deterministic low latency 
load/ store accesses to multiple cache banks 

Author: Neefs, Henk; Vandierendonck, Hans; De Bosschere, Koen 

Corporate Source: Dept. of Electronics and Info. Syst. University of 
Gent, 9000 Gent, Belgium 

Conference Title: Sixth International Symposium on High-Perf ormance 
Computer Architecture 

Conference Location: Toulouse, France Conference Date: 

20000108-20000112 

Sponsor: IEEE Computer Society; Technical Committee on Computer 
Architecture 

E.I. Conference No.: 59047 

Source: IEEE High-Perf ormance Computer Architecture Symposium Proceedings 
2000. p 313-324 

Publication Year: 2000 
CODEN: 85QSAT 
Language: English 

Document Type: CA; (Conference Article) Treatment: T; (Theoretical) 
Journal Announcement: 0203W4 

Abstract: One of the problems in future processors will be the resource 
conflicts caused by several load/store units competing to access the same 
cache bank. The traditional approach for handling this case is by 
introducing buffers combined with a cross-bar. This approach suffers from 
(i) the non-deterministic latency of a load/store and (ii) the extra 
latency caused by the cross-bar and the buffer management. A deterministic 
latency is of the utmost importance for the forwarding mechanism of 
out-of-order processors because it enables back-to-back operation of 
instructions. We propose a technique by which we eliminate the buffers and 
crossbars from the critical path of the load/store execution. This results 
in both, a low and a deterministic latency. Our solution consists of 
predicting which bank is to be accessed. Only in the case of a wrong 
prediction a penalty results. 13 Refs. 
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Abstract: In the pursuit of instruction-level parallelism, significant 
demands are placed on a processor's instruction delivery mechanism. 
Delivering the performance necessary to meet future processor execution 
targets requires that the performance of the instruction delivery 



mechanism scale with the execution core. Attaining these targets is a 
challenging task due to I-cache misses, branch mispredictions, and taken 
branches in the instruction stream. To counter these challenges, we 
present a fetch architecture that decouples the branch predictor from 
the instruction fetch unit. A Fetch Target Queue (FTQ) is inserted between 
the branch predictor and instruction cache. This allows the branch 
predictor to run far in advance of the address currently being fetched by 
the cache. The decoupling enables a number of architecture optimizations, 
including multilevel branch predictor design, fetch-directed instruction 
prefetching, and easier pipelining of the instruction cache . For the 
multilevel predictor , we show that it performs better than single-level 

predictor , even when ignoring the effects of cycle-timing issues. We 
also examine the performance of fetch-directed instruction prefetching 
using a multilevel branch predictor and show that an average 19 percent 
speedup is achieved. In addition, we examine pipelining the instruction 
cache to achieve a faster cycle time for the processor pipeline and show 
that pipelining provides an average 27 percent speedup over not 
pipelining the instruction cache for the programs examined. 43 Refs. 
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The clock distribution network and the generation circuitry are 
critical components of current synchronous digital systems and are known 
to consume more than a quarter of the power budget of existing 
microprocessors. A high-level clock energy model that captures both the 
dynamic and leakage power components is formulated. The validation results 
show an average deviation within 506 of circuit-level simulations. 

Further, Phased Locked Loops (PLLs), which have been generally used in 
clock generation, are also crucial for the implementation of Dynamic 
Voltage Scaling (DVS) mechanisms employed in emerging power conscious 
processor designs. In order to devise architectural and compiler driven 
optimizations that exploit the dynamic frequency voltage scaling features, 
accurate models that capture the performance and power characteristics of 
the PLL are essential. In addition, many emerging System-on-a-Chip (SOC) 
designs use multiple PLLs on the same die making it important to estimate 
the contribution of the PLL to the overall system power. A PLL energy and 
timing model that accurately estimates the power consumption during both 
lock and lock-acquisition states is also formulated. The applicability of 
PLLs as voltage regulators in support of leakage reduction by supply gating 
is briefly discussed. 

The complete clock energy model is incorporated into a cycle-accurate 
energy simulator for an embedded architecture. This framework is used to 
study and quantify the influence on clock energy of several 
architectural-level decisions and their relative impact on the overall 
system power. These design choices include various cache architectures 
and clock gating at different levels (top- level distribution network 
functional unit and gate level) . From the software perspective, the 
influence on clock energy of power-aware memory-oriented compiler 
optimizations is assessed. 

Finally, the model is used to predict the role that the clock will 
have in the total power budget of future designs while carefully capturing 
the impact of technology scaling. It is shown that as long as leakage power 
is kept under control, clock power will remain a significant contributor to 
the total system power. 
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This dissertation introduces a reconf igurable stage buffer, a buffer 
with a variable size and parallel ports, that holds stored data 
temporarily while the central processing unit writes data to its 
multilevel cache . A corresponding analytic model is given which leads to 
a reconf igurable stage buffer design theory. 

Because new generation RISC (Reduced Instruction Set Computing) 
architectures permit memory accesses without preserving program order in 
general, the reconf igurable stage buffer uses the CPU's store operation 
reordering, synchronization instructions and its reconf igurability to 
maximize the store parallelism. Therefore, the reconf igurable stage buffer 
is able to overcome the problem of difficulty in effectively accommodating 
the data traffic from the CPU in varying workload environments when using a 
conventional, fixed configuration, stage buffer. 

The reconf igurable stage buffer consists of several independent 
parallel buffers which can be implemented either concurrently or 
sequentially depending on upcoming workload patterns. There are 
software-controlled signals associated with the reconf igurable stage 
buffer. According to the CPU's synchronization instructions, data 
ownership, data modification status, and access atomicity, the 
reconf igurable stage buffer decides whether or not to change its 
configuration. The reconf igurable stage buffer allows a great store 
implementation flexibility and improves performance while preserving just 
enough order that desired program behavior can be guaranteed. 

A theory supporting the proposed design is presented in which an 
analytic model is introduced that allows for convenient design and analysis 
of the reconf igurable stage buffer. The model predicts the average 
traffic versus the number of independent parallel buffers and the size of 
the independent parallel buffers. The model also indicates the impact of 
the cache architecture on the reconf igurable stage buffer design. The 
prediction indicates the potential traffic congestion in the 
reconf igurable stage buffer and describes how to avoid traffic congestion. 
By using the analytic model, we can roughly evaluate reconf igurable stage 
buffer performance and robustness within a short period of time and with a 
low cost. 

An example, the PowerPC, illustrates practical use of the 
reconf igurable stage buffer architecture and provides a corresponding logic 
design. The reconf igurable stage buffer logic design includes the details 
of Tag and Status Interfaces, Data and Address Interfaces, and Control 
Logic and Buffer Register Arrangement. It is shown that the proposed 
reconf igurable stage buffer reduces the number of CPU wait states and 
alleviates store data congestion. 
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Abstract: Scaling devices while maintaining reasonable short channel 
immunity requires gate oxide thickness of less than 20 AA for CMOS devices 
beyond the 70 nm technology node. Low oxide thickness gives rise to 
considerable direct tunneling current (gate leakage) . Power dissipation in 
large caches is dominated by the gate and sub-threshold leakage current. 
This paper proposes a novel cache that has high noise immunity with 
improved leakage power. For every bank of SRAM cells, this technique 
requires an extra diode in parallel with a gated-ground transistor 
connected between the source of NMOS transistors and ground in SRAM cells. 
The row decoder itself can be used to control the extra gated-ground 
transistor. Our simulation results on a 70 nm process (Berkeley Predictive 
Technology Model augmented with our gate leakage model) show 39.2% 
reduction in consumed energy (leakage plus dynamic) in LI cache and 59.4% 
reduction in L2 cache energy with less than 2.5% impact on execution time. 
The technique is applicable to data and instruction caches as well as 
different levels of cache hierarchy such as the LI, L2, or L3 caches. 
(15 Refs) 
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Abstract: In this paper, we revisit the problem of performance 
prediction on shared memory parallel machines, motivated by the need 
for selecting parallelization strategy for random write reductions. Such 
reductions frequently arise in data mining algorithms. In our previous 
work, we have developed a number of techniques for parallelizing this class 
of reductions. Our previous work has shown that each of the three 
techniques, full replication, optimized full locking, and cache-sensitive, 
can outperform others depending upon problem, dataset, and machine 
parameters. Therefore, an important question is, "Can we predict the 
performance of these techniques for a given problem, dataset, and 
machine?". This paper addresses this question by developing an analytical 
performance model that captures a two - level cache , coherence cache 
misses, TLB misses, locking overheads, and contention for memory. 
Analytical model is combined with results from micro-benchmarking to 
predict performance on real machines. We have validated our model on two 
different SMP machines. Our results show that our model effectively 
captures the impact of memory hierarchy ( two - level cache and TLB) as 
well as the factors that limit parallelism (contention for locks, memory 
contention, and coherence cache misses) . The difference between predicted 
and measured performance is within 20% in almost all cases. Moreover, the 
model is quite accurate in predicting the relative performance of the 
three parallelization techniques. (22 Refs) 
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Abstract: In this paper, we quantify the performance of a novel family of 
multi-stage two-level adaptive branch predictors . In each two-level 
predictor , the PHT of a conventional two-level adaptive branch predictor 
is replaced by a prediction cache. Unlike a PHT, a prediction cache 



saves only relevant branch prediction information. Furthermore, 
predictions are never based on uninitialised entries and interference 
between branches is eliminated. In the case of a prediction cache miss in 
the first stage, our two-stage predictors use a default two-bit 
prediction counter stored in a second stage. We demonstrate that a 
two-stage cached predictor is more accurate than a conventional two-level 
predictor and quantify the crucial contribution made by the second 
prediction stage in achieving this high accuracy. We then extend our 
cached predictor by adding a third stage and demonstrate that a 
three-stage cached predictor further improves the accuracy of cached 
predictors . (16 Refs) 
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Abstract: During the 1990s Two-level Adaptive Branch Predictors were 
developed to meet the requirement for accurate branch prediction in 
high-performance superscale processors. However, while two-level adaptive 
predictors achieve very high prediction rates, they tend to be very 
costly. In particular, the size of the second level Pattern History Table 
(PHT) increases exponentially as a function of history register length. 
Furthermore, many of the prediction counters in a PHT are never used; 

predictions are frequently generated from non-initialised counters and 
several branches may update the same counter, resulting in interference 
between branch predictions . In this paper, we propose a Cached 
Correlated Two - Level Branch Predictor in which the PHT is replaced 
by a Prediction Cache. Unlike a PHT, the Prediction Cache saves only 
relevant branch prediction information. Furthermore, predictions are 
never based on uninitialised entries and interference between branches is 
eliminated. We simulate three versions of our Cached Correlated Branch 
Predictors . The first predictor is based on global branch history 
information while the second is based on local branch history 
information. The third predictor exploits the ability of cached 
predictors to combine both global and local history information in a 
single predictor . We demonstrate that our predictors deliver higher 
accuracy than conventional predictors at a significantly lower cost. (13 
Refs) 
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Abstract: The time a program takes to execute is significantly affected 
by the efficiency with which it utilises cache memory. Moreover the cache 
miss behaviour of a program is highly unstable, in that small changes to 
input parameters can cause large changes in the number of misses. In this 
paper we describe novel analytical methods of predicting the cache miss 
ratio of numerical programs, for sequential hierarchies of set-associative 
caches. The methods are demonstrated to be applicable to most loop nests. 
They are also shown to be highly accurate, yet able to be evaluated orders 
of magnitude faster than a comparable simulation. (12 Refs) 
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Treatment: Practical (P) 

Abstract: Presents an instruction cache prefetching mechanism that is 
capable of prefetching past branches in multiple-issue processors. At 
high clock rates, such processors often use small instruction caches which 
have significant miss rates. Prefetching from a secondary cache can hide 
the instruction cache miss penalties, but only if initiated sufficiently 
far ahead of the current program counter (PC) . Existing instruction cache 
prefetching methods are strictly sequential and cannot do that, due to 
their inability to prefetch past branches. By keeping branch history 
and branch target addresses, we predict a future PC several branches 
past the current branch. We describe a possible prefetching architecture 
and evaluate its accuracy, the impact of the instruction prefetching on 
performance, and its interaction with sequential prefetching. For a 4-issue 
processor and a cache architecture patterned after the DEC Alpha-21164, we 
show that our prefetching unit can be more effective than sequential 
prefetching. The two types of prefetching eliminate different types of 
misses and can thus be effectively combined to achieve better performance. 
(28 Refs) 
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Abstract: Faster processors are used to tackle larger problems which 
typically require a larger memory. Unfortunately this prohibits memory 
access latency from scaling with processor speed, Consequently, multiple 

levels of caching are employed which utilise temporal and spatial 
locality of reference to bridge the performance gap. However, cache 
performance is difficult to predict which is problematic for hard 
real-time systems. A tree memory structure, whose access frequency, rather 
than latency, can scale with processor speed, is proposed, together with a 
scalable memory module base virtual addressing mechanism and page based 
memory protection using capabilities. It is concluded that a multi-threaded 
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Abstract: The "high-end" water-cooled processors in the IBM Enterprise 
System/9000 (TM) product family use a CPU organization and cache 
structure which depart significantly from previous designs. The CPU 
organization includes multiple execution elements which execute 
instructions out of sequence, and uses a new virtual register 
management algorithm to control them. It also contains a branch 
history table to remember recent branches and their target addresses 
so that instruction fetching and decoding can be directed more 
accurately. These models also use a two - level cache structure 
which provides a level 1 cache associated with each processor and a 
level 2 cache associated with central storage. The level 1 cache uses a 
store-through organization, and is split into two separate caches, one 
used for instruction fetching and the other for operand references. The 
level 2 cache uses a store-in method to handle stores . 
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ABSTRACT: 



In order to meet the computational needs of the next decade, shared-memory 
processors must be scalable. Though single shared-bus architectures have 
been successful in the past , lack of bus bandwidth restricts the number 
of processors that can be effectively put on a single bus machine. One 
architecture that has been proposed to solve the limited bandwidth problem 
consists of processors connected via a tree hierarchy of buses. In this 
paper, the authors present a tool to study a hierarchical bus based 
shared-memory system. The authors highlight the main features of a 
hierarchical cache coherence protocol and give some preliminary performance 
results obtained via an instruction level simulator. 
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